Improving cross-document co-reference with semi-supervised information extraction modelsi
نویسندگان
چکیده
In this paper, we consider the problem of cross-document co-reference (CDC). Existing approaches tend to treat CDC as an information retrieval based problem and use features such as TF-IDF cosine similarity to cluster documents and/or co-reference chains. We augmented these features with features based on biographical attributes, such as occupation, nationality, gender, etc., obtained by using semisupervised attribute extraction models. Our results suggest that the addition of these features boosts the performance of our CDC system considerably. The extraction of such specific attributes allows us to use features, such as semantic similarity, mutual information and approximate name similarity which have not been used so far for CDC with traditional bag-of-words models. Our system achieves F0.5 scores of 0.82 and 0.81 on the WePS-1 and WePS-2 datasets, which rival the best reported scores for this problem.
منابع مشابه
Improving Cross-document Co-reference with Semi-supervised Information Extraction Models
In this paper, we consider the problem of cross-document co-reference (CDC). Existing approaches tend to treat CDC as an information retrieval based problem and use features such as TF-IDF cosine similarity to cluster documents and/or co-reference chains. We augmented these features with features based on biographical attributes, such as occupation, nationality, gender, etc., obtained by using ...
متن کاملSemi-supervised Statistical Inference for Business Entities Extraction and Business Relations Discovery
The sheer volume of user-contributed data on the Internet has motivated organizations to explore the collective business intelligence (BI) for improving business decisions making. One common problem for BI extraction is to accurately identify the entities being referred to in user-contributed comments. Although named entity recognition (NER) tools are available to identify basic entities in tex...
متن کاملImproving Semi-Supervised Acquisition Of Relation Extraction Patterns
This paper presents a novel approach to the semi-supervised learning of Information Extraction patterns. The method makes use of more complex patterns than previous approaches and determines their similarity using a measure inspired by recent work using kernel methods (Culotta and Sorensen, 2004). Experiments show that the proposed similarity measure outperforms a previously reported measure ba...
متن کاملA New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model
Information extraction (IE) is a process of automatically providing a structured representation from an unstructured or semi-structured text. It is a long-standing challenge in natural language processing (NLP) which has been intensified by the increased volume of information and heterogeneity, and non-structured form of it. One of the core information extraction tasks is relation extraction wh...
متن کاملIdentifying Cores of Semantic Classes in Unstructured Text with a Semi-supervised Learning Approach
Cores of semantic classes in scenario descriptions can be extremely valuable in question-answering, information extraction, and document retrieval. We propose a semi-supervised learning approach to automatically identify and classify cores of semantic classes in unstructured text. We perform a case study on medical text. The results show that the selected features characterize the cluster struc...
متن کامل